Back

Genome Research

Cold Spring Harbor Laboratory

All preprints, ranked by how well they match Genome Research's content profile, based on 409 papers previously published here. The average preprint has a 0.15% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Quartet-based Genome-scale Species Tree Inference using Multicopy Gene Family Trees

Rafi, A.; Rumi, A. M. S.; Hakim, S. A.; Bayzid, M. S.

2025-04-10 evolutionary biology 10.1101/2025.04.04.647228 medRxiv
Top 0.1%
33.6%
Show abstract

Species tree estimation from multi-copy gene family trees, including both paralogs and orthologs, is a challenging task due to the gene tree discordance caused by biological processes such as incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Quartet-based species tree estimation methods, such as ASTRAL, Quartet Max-Cut (QMC), and Quartet Fiduccia-Mattheyses (QFM) frameworks have gained substantial popularity for their accuracy and statistical guarantee. However, most of these methods rely on single-copy gene trees and model only ILS, which limits their applicability to large genomic datasets. ASTRAL-Pro incorporates both orthology and paralogy for species tree inference under GDL by employing a refined quartet similarity measure based on the concept of species-driven quartets (SQs). In this study, we show that these SQ-based techniques can be effectively leveraged within the QFM framework. This required substantial algorithmic re-engineering, including the development of efficient techniques for computing the initial bipartition in QFM and novel combinatorial methods for computing refined quartet scores directly from gene family trees. We extensively evaluated our method, wQFM-GDL, on benchmark simulated and real biological datasets and compared it with ASTRAL-Pro3, SpeciesRax, and DupLoss-2. wQFM-GDL outperforms all other methods in 113 out of 124 model conditions considered in this study, with performance differences becoming more pronounced as dataset size increases. In particular, for larger datasets with 200 and 500 taxa, wQFM-GDL significantly outperforms all leading methods in all 72 out of 72 model conditions and achieves, on average, nearly a 25% reduction in reconstruction error compared with ASTRAL-Pro3. wQFM-GDL is freely available in open source form at https://github.com/abdur-rafi/wQFM-GDL.

2
mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping

Dong, J.; Liu, X.; Sadasivan, H.; Sitaraman, S.; Narayanasamy, S.

2024-03-27 genomics 10.1101/2024.03.23.586366 medRxiv
Top 0.1%
32.5%
Show abstract

Long-read DNA sequencing is becoming increasingly popular for genetic diagnostics. Minimap2 is the state-of-the-art long-read aligner. However, Minimap2s chaining step is slow on the CPU and takes 40-68% of the time especially for long DNA reads. Prior works in accelerating Minimap2 either lose mapping accuracy, are closed source (and not updated) or deliver inconsistent speedups for longer reads. We introduce mm2-gb which accelerates the chaining step of Minimap2 on GPU without compromising mapping accuracy. In addition to intra- and inter-read parallelism exploited by prior works, mm2-gb exploits finer levels of parallelism by breaking down high latency large workloads into smaller independent segments that can be run in parallel and leverages several strategies for better workload balancing including split-kernels and prioritized scheduling of segments based on sorted size. We show that mm2-gb on an AMD Instinct MI210 GPU achieves 2.57-5.33x performance improvement on long nanopore reads (10kb-100kb), and up to 1.87x performance gain on super long reads (100kb-300kb) compared to SIMD accelerated mm2-fast. mm2-gb is open-sourced and available at https://github.com/Minimap2onGPU/mm2-gb.

3
Accelerating Identification of Chromatin Accessibility from noisy ATAC-seq Data using Modern CPUs

Chaudhary, N.; Misra, S.; Kalamkar, D.; Heinecke, A.; Georganas, E.; Ziv, B.; Adelman, M.; Kaul, B.

2021-09-30 genomics 10.1101/2021.09.28.462099 medRxiv
Top 0.1%
27.8%
Show abstract

Identifying accessible chromatin regions is a fundamental problem in epigenomics with ATAC-seq being a commonly used assay. Exponential rise in single cell ATAC-seq experiments has made it critical to accelerate processing of ATAC-seq data. ATAC-seq data can have a low signal-to-noise ratio for various reasons including low coverage or low cell count. To denoise and identify accessible chromatin regions from noisy ATAC-seq data, use of deep learning on 1D data - using large filter sizes, long tensor widths, and/or dilation - has recently been proposed. Here, we present ways to accelerate the end-to-end training performance of these deep learning based methods using CPUs. We evaluate our approach on the recently released AtacWorks toolkit. Compared to an Nvidia DGX-1 box with 8 V100 GPUs, we get up to 2.27x speedup using just 16 CPU sockets. To achieve this, we build an efficient 1D dilated convolution layer and demonstrate reduced precision (BFloat16) training.

4
10-minimizers: a promising class of constant-space minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-03-18 bioinformatics 10.64898/2026.03.16.712052 medRxiv
Top 0.1%
25.8%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.

5
One Cell At a Time: A Unified Framework to Integrate and Analyze Single-cell RNA-seq Data

Wang, C. X.; Zhang, L.; Wang, B.

2021-07-16 genomics 10.1101/2021.05.12.443814 medRxiv
Top 0.1%
25.7%
Show abstract

1The surge of single-cell RNA sequencing technologies gives rise to the abundance of large single-cell RNA-seq datasets at the scale of hundreds of thousands of single cells. Integrative analysis of large-scale scRNA-seq datasets has the potential of revealing de novo cell types as well as aggregating biological information. However, most existing methods fail to integrate multiple large-scale scRNA-seq datasets in a computational and memory efficient way. We hereby propose OCAT, One Cell At a Time, a graph-based method that sparsely encodes single-cell gene expressions to integrate data from multiple sources without most variable gene selection or explicit batch effect correction. We demonstrate that OCAT efficiently integrates multiple scRNA-seq datasets and achieves the state-of-the-art performance in cell type clustering, especially in challenging scenarios of non-overlapping cell types. In addition, OCAT efficaciously facilitates a variety of downstream analyses, such as differential gene analysis, trajectory inference, pseudotime inference and cell inference. OCAT is a unifying tool to simplify and expedite the analysis of large-scale scRNA-seq data from heterogeneous sources.

6
Correlation imputation in single cell RNA-seq using auxiliary information and ensemble learning

Gan, L.; Vinci, G.; Allen, G. I.

2020-09-04 genomics 10.1101/2020.09.03.282178 medRxiv
Top 0.1%
25.1%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSingle cell RNA sequencing is a powerful technique that measures the gene expression of individual cells in a high throughput fashion. However, due to sequencing inefficiency, the data is unreliable due to dropout events, or technical artifacts where genes erroneously appear to have zero expression. Many data imputation methods have been proposed to alleviate this issue. Yet, effective imputation can be difficult and biased because the data is sparse and high-dimensional, resulting in major distortions in downstream analyses. In this paper, we propose a completely novel approach that imputes the gene-by-gene correlations rather than the data itself. We call this method SCENA: Single cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information. The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections. In an extensive simulation study based on real scRNA-seq data, we demonstrate that SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

7
GeneFriends 2021: Updated co-expression databases and tools for human and mouse genes and transcripts

Raina, P.; Lopes, I.; Chatsirisupachai, K.; Farooq, Z.; de Magalhaes, J. P.

2021-01-10 genomics 10.1101/2021.01.10.426125 medRxiv
Top 0.1%
23.2%
Show abstract

Gene co-expression analysis has emerged as a powerful method to provide insights into gene function and regulation. The rapid growth of publicly available RNA-sequencing (RNA-seq) data has created opportunities for researchers to employ this abundant data to help decipher the complexity and biology of genomes. Co-expression networks have proven effective for inferring relationship between the genes, for gene prioritization and for assigning function to poorly annotated genes based on their co-expressed partners. To facilitate such analyses we created previously an online co-expression tool for humans and mice entitled GeneFriends. To continue providing a valuable tool to the scientific community, we have updated the GeneFriends database. Here, we present the latest version of GeneFriends, which includes updated gene and transcript co-expression networks based on RNA-seq data from 46,475 human and 34,322 mouse samples. GeneFriends is freely available at http://www.genefriends.org/

8
Movi 2: Fast and Space-Efficient Queries on Pangenomes

Zakeri, M.; Brown, N. K.; Gagie, T.; Langmead, B.

2025-10-30 genomics 10.1101/2025.10.16.682873 medRxiv
Top 0.1%
22.8%
Show abstract

Space-efficient compressed indexing methods are critical for pangenomics and for avoiding reference bias. In the Movi study, we implemented the move-structure index, highlighting its locality-of-reference and speed. However, Movi had a high memory footprint compared to other compressed indexes. Here we introduce Movi 2 and describe new methods that greatly reduce size and memory footprint of move structure-based indexes. The most compressed version of Movi 2 reduces the Movi indexs space footprint more than fivefold. We also introduce sampling approaches that enable trade-offs between query and space efficiency. To demonstrate, we show that Movi 2 achieves advantageous time and space tradeoffs when applied to large pangenome collections, including both the first and second releases of the Human Pangenome Reference Consortium (HPRC) collection, the latter of which spans over 460 human haplotyes. We show that Movi 2 dominates prior methods on both speed and memory footprint, including both r-index-based and our previous move-structure-based method. The methods we developed for Movi 2 are publicly available at https://github.com/mohsenzakeri/Movi.

9
GenVarLoader: An accelerated dataloader for applying deep learning to personalized genomics

Laub, D.; Ho, A.; Jaureguy, J.; Klie, A.; Salem, R. M.; McVicker, G.; Carter, H.

2025-01-17 genomics 10.1101/2025.01.15.633240 medRxiv
Top 0.1%
22.8%
Show abstract

Deep learning sequence models trained on personalized genomics can improve variant effect prediction, however, applications of these models are limited by computational requirements for storing and reading large datasets. We address this with GenVarLoader, which stores personalized genomic data in new memory-mapped formats with optimal data locality to achieve [~]1,000x faster throughput and [~]2,000x better compression compared to existing alternatives.

10
Minimizing Reference Bias with an Impute-First Approach

Vaddadi, N. S. K.; Mun, T.; Langmead, B.

2023-12-02 bioinformatics 10.1101/2023.11.30.568362 medRxiv
Top 0.1%
22.6%
Show abstract

Pangenome indexes reduce reference bias in sequencing data analysis. However, bias can be reduced further by using a personalized reference, e.g. a diploid human reference constructed to match a donor individuals alleles. We present a novel impute-first alignment framework that combines elements of genotype imputation and pangenome alignment. It begins by genotyping the individual using only a subsample of the input reads. It next uses a reference panel and efficient imputation algorithm to impute a personalized diploid reference. Finally, it indexes the personalized reference and applies a read aligner, which could be a linear or graph aligner, to align the full read set to the personalized reference. This framework achieves higher variant-calling recall (99.54% vs. 99.37%), precision (99.36% vs. 99.18%), and F1 (99.45% vs. 99.28%) compared to a graph pangenome aligner. The personalized reference is also smaller and faster to query compared to a pangenome index, making it an overall advantageous choice for whole-genome DNA sequencing experiments.

11
Robust and cost-efficient single-cell sequencing through combinatorial pooling

Gawron, J.; Cunha, L.; Borgsmueller, N.; Beerenwinkel, N.

2024-11-23 bioinformatics 10.1101/2024.11.22.624460 medRxiv
Top 0.1%
22.6%
Show abstract

Single-cell sequencing is widely used to study molecular cell-to-cell heterogeneity. Even though the cost of sequencing has dropped throughout the last decades, single-cell assays remain expensive, because they require strategies to index molecules by cells. The high costs of indexing can be mitigated by pooling samples prior to sequencing library preparation. Computational methods have been developed to leverage molecular features that are distinct between different samples to separate the pools into distinct datasets. However, since all multiplexed samples are processed in the same way, information on the origin of each demultiplexed dataset is lost. To map datasets to their sample of origin, additional information such as molecular indexing or additional genotyping is needed. Here, we propose a class of experimental designs that allows identifying the sample of origin of each demultiplexed dataset, only relying on the genetic profiles of the samples and the composition of pools. Our approach is based on splitting and pooling samples in specific combinations. We find a most cost-efficient experimental design in this class and prove its optimality. We present a dynamic programming algorithm to iteratively simplify an optimal experimental design by breaking it into several independent designs while maintaining optimality. Furthermore, we propose a subclass of experimental designs which allow robust sample identification even under partial failure of the experiment and present a provably optimal design in this subclass. We provide an implementation for automatic sample identification under these optimal combinatorial pooling strategies and demonstrate its functionality in a simulation study.

12
CSCN: Inference of Cell-Specific Causal Networks Using Single-Cell RNA-Seq Data

Wang, M.; Yang, J.; Lyu, L.; Chen, J.

2025-10-09 bioinformatics 10.1101/2025.10.09.681381 medRxiv
Top 0.1%
22.5%
Show abstract

Understanding gene regulation is fundamental to deciphering the coordinated activity of genes within cells. Although single-cell RNA sequencing (scRNA-seq) enables gene expression profiling at cellular resolution, most gene network inference methods operate at the tissue or population level, thereby overlooking regulatory heterogeneity across individual cells. Recent approaches, such as Cell-Specific Network (CSN) and its extension c-CSN, attempt to construct gene networks at single-cell resolution, providing a more detailed view of the regulatory logic underlying individual cellular states. However, these methods remain limited by high false positive rates due to indirect associations and lack of directionality or causal interpretability. To address these issues, we propose the Cell-Specific Causal Network (CSCN) framework, which infers directed, cell-specific gene regulatory relationships by explicitly modeling causality. CSCN combines causal discovery techniques with efficient computation using kd-trees and bitmap indexing to perform conditional independence testing, yielding sparse and interpretable causal graphs for each cell that effectively suppress indirect and spurious associations. We demonstrate through simulations that CSCN significantly reduces false positives compared to existing methods. Furthermore, we evaluate the quality of the inferred causal networks via clustering on the Causal Katz Matrix (CKM), and CSCN outperforms CSN and c-CSN in distinguishing cellular states.

13
DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

Hoang, M.; Zheng, H.; Kingsford, C.

2022-02-19 bioinformatics 10.1101/2022.02.17.480870 medRxiv
Top 0.1%
22.4%
Show abstract

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite their widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of k-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network re-parameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, DO_SCPLOWEEPC_SCPLOWMO_SCPLOWINIMIZERC_SCPLOW, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences.

14
Profiling the epigenome at home

Henikoff, S.; Henikoff, J. G.

2020-04-17 genomics 10.1101/2020.04.15.043083 medRxiv
Top 0.1%
22.2%
Show abstract

Chromatin accessibility mapping is a powerful approach to identify potential regulatory elements. A popular example is ATAC-seq, whereby Tn5 transposase inserts sequencing adapters into accessible DNA ( tagmentation). CUT&Tag is a tagmentation-based epigenomic profiling method in which antibody tethering of Tn5 to a chromatin epitope of interest profiles specific chromatin features in small samples and single cells. Here we show that by simply modifying the tagmentation conditions for histone H3K4me2 or H3K4me3 CUT&Tag, antibody-tethered tagmentation of accessible DNA sites is redirected to produce chromatin accessibility maps that are indistinguishable from the best ATAC-seq maps. Thus, chromatin accessibility maps can be produced in parallel with CUT&Tag maps of other epitopes with all steps from nuclei to amplified sequencing-ready libraries performed in single PCR tubes in the laboratory or on a home workbench. As H3K4 methylation is produced by transcription at promoters and enhancers, our method identifies transcription-coupled accessible regulatory sites.

15
Accelerating long-read analysis on modern CPUs

Kalikar, S.; Jain, C.; Md, V.; Misra, S.

2021-07-23 genomics 10.1101/2021.07.21.453294 medRxiv
Top 0.1%
22.2%
Show abstract

Long read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping of long reads or a draft genome assembly to a reference sequence is often one of the most time consuming steps in these applications. Here, we present techniques to accelerate minimap2, a widely used software for mapping. We present multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment. These result in reduction of end-to-end mapping time of minimap2 by up to 1.8 x while maintaining identical output.

16
Robust and Accurate Doublet Detection of Single-Cell Sequencing Data via Maximizing Area Under Precision-Recall Curve

CHEN, Y.; Wu, X.; Ni, K.; Hu, H.; Yue, M.; Chen, W.; Huang, H.

2023-11-02 bioinformatics 10.1101/2023.10.30.564840 medRxiv
Top 0.1%
22.0%
Show abstract

Single-cell sequencing has revolutionized our understanding of cellular heterogeneity by offering detailed profiles of individual cells within diverse specimens. However, due to the limitations of sequencing technology, two or more cells may be captured in the same droplet and share the same barcode. These incidents, termed doublets or multiplets, can lead to artifacts in single-cell data analysis. While explicit experimental design can mitigate these issues with the help of auxiliary cell markers, computationally annotating doublets has a broad impact on analyzing the existing public single-cell data and reduces potential experimental costs. Considering that doublets form only a minor fraction of the total dataset, we argue that current doublet detection methods, primarily focused on optimizing classification accuracy, might be inefficient in performing well on the inherently imbalanced data in the area under the precision-recall curve (AUPRC) metric. To address this, we introduce RADO (Robust and Accurate DOublet detection) - an algorithm designed to annotate doublets by maximizing the AUPRC, effectively tackling the imbalance challenge. Benchmarked on 18 public datasets, RADO outperforms other methods in terms of doublet score and achieves similar performance to the current best methods in doublet calling. Furthermore, beyond its application in single-cell RNA-seq data, we demonstrate RADOs adaptability to single-cell assays for transposase-accessible chromatin sequencing (scATAC-seq) data, where it outperforms other scATAC-seq doublet detection methods. RADOs open-source implementation is available at: https://github.com/poseidonchan/RADO.

17
LRMD: Reference-Free Misassembly Detection Based on Multiple Features from Long-Read Alignments

Wang, J.; Nie, F.; Shi, X.

2025-11-08 genomics 10.1101/2025.11.07.686952 medRxiv
Top 0.1%
22.0%
Show abstract

Genome assembly serves as the cornerstone of genomics research, with the detection of misassembly playing a crucial role in downstream analyses. Reference-free methods for misassembly detection, leveraging read alignments, enable us to circumvent the need for high-quality reference genomes and broaden their applicability. However, existing methods struggle to effectively utilize alignment data, leading to a noticeable deficiency in sensitivity for detecting misassemblies. We introduce LRMD, a novel reference-free tool for misassembly detection. LRMD integrates depth, clipping, and read pileup information derived from long-read-to-assembly alignments to significantly enhance sensitivity in identifying misassemblies. Experimental evaluations on both simulated and real datasets demonstrate that LRMD consistently outperforms existing tools in terms of sensitivity and F1-score. Notably, its results are closest to the reference-based evaluation tool QUAST. As an evaluation tool, LRMD also outputs metrics such as base quality, assembly size, contig N50, and others. LRMD is public available at http://github.com/sxfss/LRMD.

18
Generation and analysis of a mouse multi-tissue genome annotation atlas

Adams, M. S.; Vollmers, C.

2024-02-01 genomics 10.1101/2024.01.31.578267 medRxiv
Top 0.1%
22.0%
Show abstract

Generating an accurate and complete genome annotation for an organism is complex because the cells within each tissue can express a unique set of transcript isoforms from a unique set of genes. A comprehensive genome annotation should contain information on what tissues express what transcript isoforms at what level. This tissue-level isoform information can then inform a wide range of research questions as well as experiment designs. Long-read sequencing technology combined with advanced full-length cDNA library preparation methods has now achieved throughput and accuracy where generating these types of annotations is achievable. Here, we show this by generating a genome annotation of the mouse (Mus musculus). We used the nanopore-based R2C2 long-read sequencing method to generate 64 million highly accurate full length cDNA consensus reads - averaging 5.4 million reads per tissue for a dozen tissues. Using the Mandalorion tool we processed these reads to generate the Tissue-level Atlas of Mouse Isoforms (TAMI - available at https://genome.ucsc.edu/s/vollmers/TAMI) which we believe will be a valuable complement to conventional, manually curated reference genome annotations.

19
Pan-cell type continuous chromatin state annotation of all IHEC epigenomes

Daneshpajouh, H.; Moghul, I.; Wiese, K. C.; Libbrecht, M. W.

2025-02-08 genomics 10.1101/2025.02.06.636950 medRxiv
Top 0.1%
21.9%
Show abstract

The International Human Epigenome Consortium has generated thousands of epigenomic datasets that mea-sure various biochemical activities in the genome, including transcription factor binding, histone modification, and DNA accessibility. Currently, the predominant methods for integrating these datasets to annotate regu-latory elements are segmentation and genome annotation (SAGA) algorithms. The majority of annotations by these methods are cell type-specific. However, as the number of profiled cell types has grown into the thousands, using thousands of cell type-specific chromatin state annotations proves undesirable for many applications. Here, we present a pan-cell type annotation that summarizes all IHEC epigenomes using the recently-developed method, epigenome-ssm.

20
sc-REnF: An entropy guided robust feature selection for clustering of single-cell rna-seq data

Lall, S.; Ghosh, A.; Ray, S.; Bandyopadhyay, S.

2020-10-10 bioinformatics 10.1101/2020.10.10.334573 medRxiv
Top 0.1%
21.8%
Show abstract

Many single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of Renyi and Tsallis entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc-REnF